Mean Gamma Deviance (mean_gamma_deviance) — Regression Metric (From Scratch)#

Mean Gamma deviance is a scale-invariant regression loss for strictly positive targets. It measures error through the ratio between the true value and the prediction, making it a strong fit when relative errors matter (and variance tends to grow with the square of the mean).

Goals

  • Build intuition with small numeric examples + Plotly visuals

  • Derive the formula + key properties (non-negativity, scale invariance, asymmetry)

  • Implement mean_gamma_deviance in NumPy (from scratch) and validate vs scikit-learn

  • Use the loss to fit a simple Gamma regression (GLM) with gradient descent (low-level NumPy)

  • Summarize pros/cons, good use cases, and common pitfalls

Quick import#

from sklearn.metrics import mean_gamma_deviance
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.linear_model import GammaRegressor
from sklearn.metrics import mean_gamma_deviance
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(7)
np.set_printoptions(precision=4, suppress=True)
import plotly
import sklearn

print("numpy  :", np.__version__)
print("pandas :", pd.__version__)
print("sklearn:", sklearn.__version__)
print("plotly :", plotly.__version__)
numpy  : 1.26.2
pandas : 2.1.3
sklearn: 1.6.0
plotly : 6.5.2

Prerequisites#

  • Regression setup: targets \(y\) and predictions \(\hat y\)

  • Comfort with logs and ratios

  • (Optional) basic derivatives for the gradient / optimization section

Domain constraints

  • Requires strictly positive values: \(y_i > 0\) and \(\hat y_i > 0\).

1) Definition and notation#

Let:

  • \(y \in \mathbb{R}_{>0}^n\) be the true targets

  • \(\hat y \in \mathbb{R}_{>0}^n\) be the predictions

The Gamma deviance per sample is:

\[ d_i(y_i, \hat y_i) = 2\left(\log\frac{\hat y_i}{y_i} + \frac{y_i}{\hat y_i} - 1\right) \]

The mean Gamma deviance is:

\[ \mathrm{MGD}(y, \hat y) = \frac{1}{n}\sum_{i=1}^n d_i(y_i, \hat y_i) \]

In scikit-learn, mean_gamma_deviance is the Tweedie deviance with power \(p=2\).

Where this comes from (deviance / likelihood view)#

In generalized linear models (GLMs), a deviance compares your model to a “perfect” saturated model:

\[ D(y, \mu) = 2\sum_{i=1}^n \big[\ell(y_i; y_i) - \ell(y_i; \mu_i)\big] \]

For the Gamma family (ignoring dispersion constants), the negative log-likelihood for one sample can be written as:

\[ -\ell(y; \mu) = \log \mu + \frac{y}{\mu} + \text{const} \]

So the deviance contribution is:

\[ d(y, \mu) = 2\Big[(\log \mu + y/\mu) - (\log y + 1)\Big] = 2\left(\log\frac{\mu}{y} + \frac{y}{\mu} - 1\right) \]

This is exactly the formula used by mean_gamma_deviance.

Ratio form (why it measures relative error)#

Define the ratio

\[ r_i = \frac{y_i}{\hat y_i} \quad (>0) \]

Then:

\[ d_i = 2\left(r_i - \log r_i - 1\right) \]

So the loss depends only on the ratio \(y/\hat y\):

  • scaling both \(y\) and \(\hat y\) by the same constant leaves \(r\) unchanged → same deviance

  • a “2× overprediction” (\(\hat y = 2y\)) is penalized the same whether \(y=1\) or \(y=1000\)

Non-negativity (and when it is 0)#

A key inequality for \(r>0\) is:

\[ \log r \le r - 1 \]

Rearranging:

\[ r - \log r - 1 \ge 0 \]

Therefore \(d_i \ge 0\) for every sample, and:

  • \(d_i = 0\) iff \(r_i = 1\) iff \(y_i = \hat y_i\)

2) Intuition: shape of the penalty#

Asymmetry#

Because \(d_i\) depends on the ratio \(r=y/\hat y\):

  • underprediction (\(\hat y < y\)) gives \(r>1\) and grows roughly linearly in \(r\)

  • overprediction (\(\hat y > y\)) gives \(r<1\) and grows roughly like \(-\log r\)

So big underpredictions are typically penalized more strongly than equally-large overpredictions (in multiplicative terms).

Small-error approximation#

Let the relative error be

\[ \varepsilon = \frac{y-\hat y}{\hat y} \quad\Rightarrow\quad r = \frac{y}{\hat y} = 1+\varepsilon \]

Using \(\log(1+\varepsilon) \approx \varepsilon - \varepsilon^2/2\) for small \(\varepsilon\):

\[ \frac{d}{2} = r - \log r - 1 \approx (1+\varepsilon) - (\varepsilon - \varepsilon^2/2) - 1 = \varepsilon^2/2 \]

So for small errors:

\[ d \approx \varepsilon^2 = \left(\frac{y-\hat y}{\hat y}\right)^2 \]

Interpretation: Gamma deviance behaves like squared relative error near the optimum.

# Visualize d(r) = 2 * (r - log r - 1)
r = np.logspace(-2, 2, 600)  # r = y / y_hat

d = 2 * (r - np.log(r) - 1)

fig = px.line(
    x=r,
    y=d,
    log_x=True,
    title="Gamma deviance as a function of ratio r = y / ŷ",
    labels={"x": "r = y / ŷ (log scale)", "y": "d(r)"},
)
fig.add_vline(x=1.0, line_dash="dash", line_color="black")
fig.update_layout(showlegend=False)
fig.show()
# A few multiplicative errors side-by-side
ratios = np.array([0.1, 0.5, 1.0, 2.0, 10.0])  # r = y / y_hat
contrib = 2 * (ratios - np.log(ratios) - 1)

pd.DataFrame({"r = y/ŷ": ratios, "d(r)": contrib}).style.format({"d(r)": "{:.4f}"})
  r = y/ŷ d(r)
0 0.100000 2.8052
1 0.500000 0.3863
2 1.000000 0.0000
3 2.000000 0.6137
4 10.000000 13.3948

3) A tiny worked example#

We’ll compute mean Gamma deviance step-by-step and compare to scikit-learn.

y_true = np.array([2.0, 0.5, 1.0, 4.0])
y_pred = np.array([0.5, 0.5, 2.0, 2.0])

per_sample = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)

df = pd.DataFrame(
    {
        "y_true": y_true,
        "y_pred": y_pred,
        "ratio r=y/ŷ": y_true / y_pred,
        "per-sample deviance d_i": per_sample,
    }
)

mgd_np = float(per_sample.mean())
mgd_sklearn = mean_gamma_deviance(y_true, y_pred)

(df.style.format({"ratio r=y/ŷ": "{:.3f}", "per-sample deviance d_i": "{:.4f}"}), mgd_np, mgd_sklearn)
(<pandas.io.formats.style.Styler at 0x796358e94da0>,
 1.0568528194400546,
 1.0568528194400546)

4) NumPy implementation (from scratch)#

A minimal implementation is just the formula plus:

  • shape checks

  • positivity checks (or optional clipping)

  • optional sample weights

def mean_gamma_deviance_np(y_true, y_pred, sample_weight=None, *, eps=0.0):
    '''Mean Gamma deviance (NumPy).

    Matches scikit-learn's definition:
        d_i = 2 * (log(y_pred / y_true) + y_true / y_pred - 1)
        mean = average(d_i, weights=sample_weight)

    Notes
    -----
    - Requires y_true > 0 and y_pred > 0.
    - If eps > 0, values are clipped to [eps, +inf) to avoid log/div-by-zero.
    '''

    y_true = np.asarray(y_true, dtype=float)
    y_pred = np.asarray(y_pred, dtype=float)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"Shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    if eps > 0:
        y_true = np.clip(y_true, eps, None)
        y_pred = np.clip(y_pred, eps, None)

    if np.any(y_true <= 0) or np.any(y_pred <= 0):
        raise ValueError("mean_gamma_deviance requires strictly positive y_true and y_pred")

    dev = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
    return float(np.average(dev, weights=sample_weight))
# Validate vs scikit-learn (including sample weights)
y_true = rng.lognormal(mean=0.0, sigma=0.8, size=1_000)
y_pred = rng.lognormal(mean=0.1, sigma=0.8, size=1_000)
weights = rng.uniform(0.5, 2.0, size=1_000)

print("unweighted np     :", mean_gamma_deviance_np(y_true, y_pred))
print("unweighted sklearn:", mean_gamma_deviance(y_true, y_pred))
print()
print("weighted np       :", mean_gamma_deviance_np(y_true, y_pred, sample_weight=weights))
print("weighted sklearn  :", mean_gamma_deviance(y_true, y_pred, sample_weight=weights))
unweighted np     : 1.378436722535223
unweighted sklearn: 1.378436722535223

weighted np       : 1.3822041665484144
weighted sklearn  : 1.3822041665484144

5) Relative vs absolute metrics (scale invariance in one picture)#

Consider a fixed multiplicative error: \(\hat y = c\,y\).

  • Gamma deviance depends only on \(r=y/\hat y = 1/c\)constant across scales.

  • Squared error \((y-\hat y)^2\) grows with the scale of \(y\).

y = np.logspace(0, 3, 80)  # 1 .. 1000
c = 1.5  # 50% overprediction

y_hat = c * y

per_sample_gamma = 2 * (np.log(y_hat / y) + y / y_hat - 1)
per_sample_sq = (y - y_hat) ** 2

fig = go.Figure()
fig.add_trace(go.Scatter(x=y, y=per_sample_gamma, mode="lines", name="Gamma deviance (per-sample)"))
fig.add_trace(
    go.Scatter(
        x=y,
        y=per_sample_sq,
        mode="lines",
        name="Squared error (per-sample)",
        yaxis="y2",
    )
)

fig.update_layout(
    title="Same relative error (ŷ = 1.5·y) across scales",
    xaxis=dict(title="y (log scale)", type="log"),
    yaxis=dict(title="Gamma deviance (scale-invariant)"),
    yaxis2=dict(title="Squared error (scale-dependent)", overlaying="y", side="right", type="log"),
    legend=dict(x=0.02, y=0.98),
)
fig.show()

6) Gradients (useful for optimization)#

For a single sample:

\[ d(y, \hat y) = 2\left(\log\frac{\hat y}{y} + \frac{y}{\hat y} - 1\right) \]

Derivative w.r.t. the prediction \(\hat y\):

\[ \frac{\partial d}{\partial \hat y} = 2\left(\frac{1}{\hat y} - \frac{y}{\hat y^2}\right) = 2\,\frac{\hat y - y}{\hat y^2} \]

7) Using mean Gamma deviance to optimize a model (Gamma regression)#

A common way to build a model that always predicts positive values is a log link:

\[ \eta_i = x_i^\top w \qquad\Rightarrow\qquad \mu_i = \exp(\eta_i) \]

We fit \(w\) by minimizing mean Gamma deviance:

\[ J(w) = \frac{1}{n}\sum_{i=1}^n 2\left(\log\frac{\mu_i}{y_i} + \frac{y_i}{\mu_i} - 1\right) \]

Using the gradient w.r.t. \(\eta\):

\[ \frac{\partial d_i}{\partial \eta_i} = 2\left(1 - \frac{y_i}{\mu_i}\right) \]

and \(\eta = Xw\), the gradient is:

\[ \nabla_w J(w) = \frac{2}{n} X^\top\left(\mathbf{1} - \frac{y}{\mu}\right) \]

We’ll implement gradient descent and fit a toy dataset generated from a Gamma model.

# Synthetic Gamma-regression dataset (variance grows ~ mean^2)
n = 600
x1 = rng.normal(size=n)
x2 = rng.normal(size=n)

X_raw = np.column_stack([x1, x2])
X_raw = (X_raw - X_raw.mean(axis=0)) / X_raw.std(axis=0)  # simple standardization

X = np.column_stack([np.ones(n), X_raw])  # add intercept

w_true = np.array([0.25, 0.9, -0.6])
eta_true = X @ w_true
mu_true = np.exp(eta_true)

shape_k = 6.0  # larger k => less noise (var = mu^2 / k)
y = rng.gamma(shape_k, scale=mu_true / shape_k)

(mu_true.min(), mu_true.mean(), mu_true.max(), y.min(), y.mean(), y.max())
(0.03858702982088678,
 2.3581594109476653,
 52.632742976010974,
 0.03697952466385566,
 2.2560484806999526,
 67.10897917526945)
# Visualize heteroscedasticity (log-log scatter)
line = np.logspace(np.log10(mu_true.min()), np.log10(mu_true.max()), 200)

fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_true, y=y, mode="markers", name="samples", marker=dict(opacity=0.55)))
fig.add_trace(go.Scatter(x=line, y=line, mode="lines", name="y = μ", line=dict(color="black", dash="dash")))

fig.update_layout(
    title="Synthetic data: y ~ Gamma(mean=μ) (variance increases with μ)",
    xaxis=dict(title="true mean μ", type="log"),
    yaxis=dict(title="observed y", type="log"),
)
fig.show()
X_train, X_test, y_train, y_test, mu_train, mu_test = train_test_split(
    X, y, mu_true, test_size=0.25, random_state=0
)


def fit_gamma_regression_gd(X, y, *, lr=0.05, n_iter=2500, clip_eta=20.0):
    # Fit log-link Gamma regression by minimizing mean Gamma deviance.
    n, d = X.shape
    w = np.zeros(d)
    history = np.zeros(n_iter)

    for t in range(n_iter):
        eta = np.clip(X @ w, -clip_eta, clip_eta)
        mu = np.exp(eta)

        history[t] = mean_gamma_deviance_np(y, mu)

        # grad = (2/n) * X^T (1 - y/mu)
        grad = (2.0 / n) * (X.T @ (1.0 - y / mu))
        w -= lr * grad

    return w, history


w_gd, history = fit_gamma_regression_gd(X_train, y_train, lr=0.05, n_iter=2500)

print("w_true:", w_true)
print("w_gd  :", w_gd)
print()
print("train MGD:", mean_gamma_deviance_np(y_train, np.exp(np.clip(X_train @ w_gd, -20, 20))))
print("test  MGD:", mean_gamma_deviance_np(y_test, np.exp(np.clip(X_test @ w_gd, -20, 20))))
w_true: [ 0.25  0.9  -0.6 ]
w_gd  : [ 0.2218  0.8851 -0.5517]

train MGD: 0.16124082232077153
test  MGD: 0.18385887160506967
fig = px.line(
    x=np.arange(len(history)),
    y=history,
    title="Training mean Gamma deviance during gradient descent",
    labels={"x": "iteration", "y": "mean_gamma_deviance"},
)
fig.show()
mu_pred_test = np.exp(np.clip(X_test @ w_gd, -20, 20))

line = np.logspace(np.log10(mu_test.min()), np.log10(mu_test.max()), 200)

fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_test, y=mu_pred_test, mode="markers", name="pred", marker=dict(opacity=0.6)))
fig.add_trace(go.Scatter(x=line, y=line, mode="lines", name="ideal", line=dict(color="black", dash="dash")))

fig.update_layout(
    title="Recovered mean μ on test set (gradient descent)",
    xaxis=dict(title="true μ", type="log"),
    yaxis=dict(title="predicted μ", type="log"),
)
fig.show()
# Compare to scikit-learn's GammaRegressor
# (it includes an intercept internally, so we pass features without the intercept column)

gr = GammaRegressor(alpha=0.0, max_iter=5000)
gr.fit(X_train[:, 1:], y_train)

mu_pred_sk = gr.predict(X_test[:, 1:])

print("sklearn [intercept, coef]:", np.r_[gr.intercept_, gr.coef_])
print("gd      w             :", w_gd)
print()
print("test MGD (gd)     :", mean_gamma_deviance(y_test, mu_pred_test))
print("test MGD (sklearn):", mean_gamma_deviance(y_test, mu_pred_sk))
sklearn [intercept, coef]: [ 0.2218  0.8851 -0.5517]
gd      w             : [ 0.2218  0.8851 -0.5517]

test MGD (gd)     : 0.18385887160506967
test MGD (sklearn): 0.18385841689188462

8) Pros, cons, and when to use mean Gamma deviance#

Pros#

  • Scale-invariant / relative: depends on \(y/\hat y\), so it treats multiplicative errors consistently across scales.

  • Well-matched to Gamma-like data (strictly positive, heteroscedastic with \(\mathrm{Var}(y) \propto \mu^2\)).

  • For log-link linear models (\(\mu = \exp(Xw)\)), the objective is convex in \(w\) (good optimization properties).

Cons#

  • Requires strictly positive \(y\) and \(\hat y\) (zeros/negatives break the log / division).

  • Can explode when predictions get very close to 0 (huge penalty).

  • Not in the same units as \(y\) (less directly interpretable than MAE/RMSE).

  • Asymmetric in multiplicative error: large underpredictions tend to be punished more than equally-large overpredictions.

Good use cases#

  • Modeling positive continuous quantities where relative error matters: insurance claim severity, costs/prices, rainfall amounts, time-to-complete, demand with strictly positive values.

9) Common pitfalls and diagnostics#

  • Zeros in the target: Gamma deviance requires \(y>0\).

    • If you have many zeros + positive continuous values, consider a Tweedie model with \(1<p<2\) (compound Poisson) or a two-part model.

  • Model outputs can go negative: if you use a plain linear model for \(\hat y\), it can produce invalid (≤0) predictions.

    • Use a positive link function (e.g., \(\exp(\cdot)\) or softplus) when training with Gamma deviance.

  • Numerical stability: clip predictions away from 0 (and sometimes clip the linear predictor before applying \(\exp\)).

  • Check relative residuals: since the loss is relative, inspect \((y-\hat y)/\hat y\) vs \(\hat y\).

10) Exercises#

  1. Implement a weighted gradient descent version (use sample_weight) and verify it matches np.average weighting.

  2. Show empirically that, for small relative errors, Gamma deviance is close to squared relative error.

  3. Compare three approaches on the same positive dataset:

    • minimize MSE on \(y\)

    • minimize MSE on \(\log y\) (lognormal-ish)

    • minimize Gamma deviance (Gamma-ish) and inspect which residual pattern looks most appropriate.

References#

  • scikit-learn API: sklearn.metrics.mean_gamma_deviance

  • scikit-learn User Guide: Tweedie deviance (mean_tweedie_deviance)

  • Generalized Linear Models (GLMs) and Gamma regression (log link)